From DNA sequence analysis to modelling replication in the human genome 
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We explore the large-scale behavior of nucleotide compositional strand asymmetries along human 
chromosomes. As we observe for 7 of 9 origins of replication experimentally identified so far, the 
{TA + GO) skew displays rather sharp upward jumps, with a linear decreasing profile in between 
two successive jumps. We present a model of replication with well positioned replication origins 
and random terminations that accounts for the observed characteristic serrated skew profiles. We 
succeed in identifying 287 pairs of putative adjacent replication origins with an origin spacing --^ 1- 
2Mbp, that are likely to correspond to replication foci observed in interphase nuclei and recognized 
as stable structures that persist throughout subsequent cell generations. 

PACS numbers: 87.15.Cc, 87.16.Sr, 87.15.Aa 



DNA replication is an essential genomic function re- 
sponsible for the accurate transmission of genetic infor- 
mation through successive cell generations. According 
to the "replicon" paradigm derived from prokaryotes [l[ , 
this process starts with the binding of some "initiator" 
protein to a specific "replicator" DNA sequence called 
origin of replication (ori). The recruitement of addi- 
tional factors initiates the bidirectional progression of two 
divergent replication forks along the chromosome. One 
strand is replicated continuously from the origin (lead- 
ing strand), while the other strand is replicated in dis- 
crete steps towards the origin (lagging strand). In eu- 
karyotic cells, this event is initiated at a number of ori 
and propagates until two converging forks collide at a 
terminus of replication (ter) 0]. The initiation of differ- 
ent ori is coupled to the cell cycle but there is a definite 
flexibility in the usage of the ori at different develop- 
mental stages Also, it can be strongly influenced 
by the distance and timing of activation of neighbour- 
ing ori, by the transcriptional activity and by the local 
chromatin structure [H. Actually, sequence requirements 
for an ori vary significantly between different eukaryotic 
organisms. In the unicellular eukaryote Saccharomyces 
cerevisiae, the ori spread over 100-150 bp and present 
some highly conserved motifs 0]. In the fission yeast 
Schizosaccharomyces pombe, there is no clear consensus 
sequence and the ori spread over at least 800 to 1000 
&p In multi-cellular organisms, the ori are rather 
poorly defined and initiation may occur at multiple sites 
distributed over thousands of base pairs 0] ■ Actually, cell 
diversification may have led higher eukaryotes to develop 
various epigenetic controls over the ori selection rather 
than to conserve specific replicator sequences [1]. This 
might explain that only very few ori have been identified 
so far in multi-cellular eukaryotes, namely around 20 in 
metazoa and only about 10 in human 0]. The aim of the 
present work is to show that with an appropriate coding 
and an adequate methodology, one can challenge the is- 
sue of detecting putative ori directly from the genomic 
sequences. 

According to the second parity rule 3, under no- 



strand bias conditions, each genomic DNA strand should 
present equimolarities of A and T and of G and C. Devi- 
ations from intrastrand equimolarities have been exten- 
sively studied in prokaryotic, organelle and viral genomes 
for which they have been used to detect the ori Q. In- 
deed the GC and TA skews abruptly switch sign at the 
ori and ter displaying step like profiles, such that the 
leading strand is generally richer in G than in C, and to 
a lesser extent in T than in A. During replication, muta- 
tional events can affect the leading and lagging strands 
differently, and an asymmetry can result if one strand in- 
corporates more mutations of a particular type or if one 
strand is more efficiently repaired In eukaryotes, the 
existence of compositional biases has been debated and 
most attempts to detect the ori from strand composi- 
tional asymmetry have been inconclusive. In primates, 
a comparative study of the /3-globin ori has failed to 
reveal the existence of a replication-coupled mutational 
bias [l3|- Other studies have led to rather opposite re- 
sults. The analysis of the yeast genome presents clear 
replication-coupled strand asymmetries in subtelomeric 
chromosomal regions [ll|. A recent space-scale analysis 
[l^ of the GC and TA skews in Mbp long human contigs 
has revealed the existence of compositional strand asym- 
metries in intergenic regions, suggesting the existence of a 
replication bias. Here, we show that the [T A + GC) skew 
profiles of the 22 human autosomal chromosomes, display 
a remarkable serrated "factory roof" like behavior that 
differs from the crenelated "castle rampart" likeprofiles 
resulting from the prokaryotic replicon model [9|. This 
observation will lead us to propose an alternative model 
of replication in higher eukaryotes. 

Sequences and gene annotation data were downloaded 
from the UCSC Genome Bioinformatics site and corre- 
spond to the assembly of July 2003 of the human genome. 
To exclude repetitive elements that might have been 
inserted recently and would not reflect long-term evo- 
lutionary patterns, we used the repeat-masked version 
of the genome leading to a homogeneous reduction of 

40 — 50% of sequence length. All analyses were carried 
out using "knowngene" gene annotations. The TA and 




FIG. 1: 5* = Sta + Sac vs the position n in the repeat- 
masked sequences, in regions surrounding 3 known human 
ori (vertical bars): (a) MCM4 (native position 48.9 Mbp in 
chr. 8 [7(b)]); (b) c-myc (nat. pos. 128.7 Mbp in chr. 8 
[7(a)]); (c) TOPI (nat. pos. 40.3 PIbp in chr. 20 [7(c)]). 
The values of Sta and Sac were calculated in adjacent 
1 kbp windows. The dark (light) grey dots refer to "sense" 
( "antisense" ) genes with coding strand identical (opposed) 
to the sequence; black dots correspond to intergenes. 

GC skews were calculated as Sta = {T — A)/{T + A) and 
Sgc = (G — C)/{G + C). Here, we will mainly consider 
S = Sta + Sqc-: since by adding the two skews, the sharp 
transitions of interest are significantly amplified. 

In Fig. [1] arc shown the skew S profiles of 3 frag- 
ments of chromosomes 8 and 20 that contain 3 experi- 
mentally identified ori. As commonly observed for eubac- 
terial genomes [9| , these 3 ori correspond to rather sharp 
(over several kbp) transitions from negative to positive 
S values that clearly emerge from the noisy background. 
The leading strand is relatively enriched in T over A and 
in G over C. The investigation of 6 other known human 
ori confirms the above observation for at least 4 of 
them (the 2 exceptions, namely the Lamin B2 and (3- 
globin ori, might well be inactive in germline cells or 
less frequently used than the adjacent ori). According 
to the gene environment, the amplitude of the jump can 
be more or less important and its position more or less 
localized (from a few kbp to a few tens kbp). Indeed, it 
is known that transcription generates positive TA and 
GC skews on the coding strand [H, [13] , which explains 
that larger jumps are observed when the sense genes are 
on the leading strand and/or the antisense genes on the 
lagging strand, so that replication and transcription bi- 
ases add to each other. On the contrary to the replicon 
characteristic step like profile observed for eubacteria , 
S is definitely not constant on each side of the ori loca- 
tion making quite elusive the detection of the ter since 
no corresponding downward jumps of similar amplitude 
can be found in Fig. [TJ 

In Fig. [2] are shown the S profiles of long fragments of 
chromosomes 9, 14 and 21, that are typical of a fair pro- 
portion of the S profiles observed for each chromosome. 
Sharp upward jumps of amplitude (AS* ^ 0.2) similar to 
the ones observed for the known ori in Fig. [1] seem to 
exist also at many other locations along the human chro- 



FIG. 2: S = Sta + Sac skew profiles in 9 Mbp repeat-masked 
fragments in the human chromosomes 9 (a), 14 (b) and 21 
(c). Qualitatively similar but less spectacular serrated S 
profiles are obtained with the native human sequences. 

mosomes. But the most striking feature is the fact that 
in between two neighboring major upward jumps, not 
only the noisy 5' profile does not present any comparable 
downward sharp transition, but it displays a remarkable 
decreasing linear behavior. At chromosome scale, one 
thus gets jagged S profiles that have the aspects of "fac- 
tory roofs" rather than "castle rampart" step like pro- 
files as expected for the prokaryotic replicon model 0]. 
The S profiles in Fig. [5] look somehow disordered because 
of the extreme variability in the distance between two 
successive upward jumps, from spacings ~ 50-100 khp 
(~ 100-200 kbp for the native sequences) up to 2-3 Mbp 
(~ 4-5 Mbp for the native sequences) in agreement with 
recent experimental studies that have shown that mam- 
malian replicons are heterogeneous in size with an av- 
erage size ^ 500 khp, the largest ones being as large 
as a few Mbp [l^. We report in Fig. [3] the results of 
a systematic detection of upward and downward jumps 
using the wavelet-transform (WT) based methodology 
described in Ref. [12(b)]. The selection criterium was 
to retain only the jumps corresponding to discontinu- 
ities in the S profile that can still be detected with the 
WT microscope up to the scale 200 kbp which is smaller 
than the typical replicon size and larger than the typ- 
ical gene size. In this way, we reduce the contribution 
of jumps associated with transcription only and main- 
tain a good sensitivity to replication induced jumps. A 
set of 5100 jumps was detected (with as generally ex- 
pected an almost equal proportion of upward and down- 
ward jumps). In Fig.[3{a) are reported the histograms of 
the amplitude | AS"! of the so-identified upward (AS* > 0) 
and downward (AS' < 0) jumps respectively, for the 
repeat-masked sequences. These histograms do not su- 
perimpose, the former being significantly shifted to larger 
lASI values. When plotting iVdA^I > AS*) vs AS* in 
Fig.[3{b), one can see that the number of large amphtude 
upward jumps overexceeds the number of large amplitude 
downward jumps. These results confirm that most of the 
sharp upward transitions in the S profiles in Figs [1] and 
[51 have no sharp downward transition counterpart. This 
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FIG. 3: Statistical analysis of the sharp jumps detected in 
the S profiles of the 22 human autosomal chromosomes by 
the WT microscope at scale a = 200 kbp for repeat-masked 
sequences [12(b)]. lA^I = |S(3') -S(5')|, where the averages 
were computed over the two adjacent 20 kbp windows 
respectively in the 3' and 5' direction from the detected 
jump location, (a) Histograms A''(|AS|) of \AS\ values, (b) 
N{\AS\ > AS*) vs AS*. In (a) and (b), the solid (resp. 
thin) line corresponds to downward AS < (resp. upward 
AS > 0) jumps. 

demonstrates that these jagged S profiles are likely to be 
representative of a general asymmetry in the skew profile 
behavior along the human chromosomes. 

As reported in a previous work , the analysis of a 
complete set of human genes revealed that most of them 
present TA and GC skews and that these biases arc cor- 
related to each other and are specific to gene sequences. 
One can thus wonder to which extent the transcription 
machinery can account for the jagged 5" profiles shown 
in Figs [T] and [H According to the estimates obtained 
in Ref. |14l |. the mean jump amplitudes observed at the 
transition between transcribed and non-transcribed re- 
gions are IA^taI ~ 0.05 and IA^gcI ~ 0.03 respectively. 
The characteristic amplitude of a transcription induced 
transition \AS\ ^ 0.08 is thus significantly smaller than 
the amplitude AS' ^ 0.20 of the main upward jumps 
in Fig. O Hence, it is possible that, at the transition 
between an antisense gene and a sense gene, the over- 
all jump from negative to positive S values may reach 
sizes AS ~ 0.16 that can be comparable to the ones 
of the upward jumps in Fig. [21 However, if some co- 
orientation of the transcription and replication processes 
may account for some of the sharp upward transitions 
in the skew profiles, the systematic observation of "fac- 
tory roof" skew scenery in intergenic regions as well as in 
transcribed regions, strongly suggests that this peculiar 
strand bias is likely to originate from the replication ma- 
chinery. To further examine if intergenic regions present 
typical "factory roof" skew profiles, we report in Fig. [3] 
the results of the statistical analysis of 287 pairs of pu- 
tative adjacent ori that actually correspond to 486 pu- 
tative ori almost equally distributed among the 22 auto- 
somal chromosomes. These putative ori were identified 
by (i) selecting pairs of successive jumps of amplitude 
AS* > 0.12, and (ii) checking that none of these upward 
jumps could be explained by an antisense gene — sense 
gene transition. In Fig. HJa) is shown the S profile ob- 
tained after rcscaling the putative ori spacing I to 1 prior 
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FIG. 4: Statistical analysis of the skew profiles of the 287 
pairs of ori selected as explained in the text. The on spacing 
I was rescaled to 1 prior to computing the mean S values in 
windows of width 1/10, excluding from the analysis the first 
and last half intervals, (a) Mean S profile (•) over windows 
that are more than 90% intergenic. (b) Mean S profile (•) 
over windows that are more than 90% genie; the symbols (a) 
(resp. (□)) correspond to the percentage of sense (antisense) 
genes located at that position among the 287 putative ori 
pairs, (c) Histogram of the slope s of the skew profiles 
after rescaling Z to 1. (d) Histogram of the mean absolute 
deviation of the S profiles from a linear profile. 

to computing the average S values in windows of width 
1/10 that contain more than 90% of intergenic sequences. 
This average profile is linear and crosses zero at the me- 
dian position n/l = 1/2, with an overall upward jump 
AS* ~ 0.17. The corresponding average S profile over 
windows that are now more than 90% genie is shown in 
Fig. mjb). A similar linear profile is obtained but with 
a jump of larger mean amplitude AS ~ 0.28. This is a 
direct consequence of the gene content of the selected re- 
gions. As shown in Fig. UJb), sense (antisense) genes are 
preferentially on the left (right) side of the 287 selected 
sequences, which implies that the replication and - when 
present - transcription biases tend to add up. In Fig.^I^c) 
is shown the histogram of the linear slope values of the 
287 selected skew profiles after rescaling their length to 
1 . The histogram of mean absolute deviation from a lin- 
ear decreasing profile reported in Fig.|3]Jd), confirms the 
linearity of each selected skew profiles. 

Following these observations, wc propose in Fig. [5] a 
rather crude model for replication that relies on the hy- 
pothesis that the ori are quite well positioned while the 
ter are randomly distributed. In other words, replication 
would proceed in a bi-directional manner from well de- 
fined initiation positions, whereas the termination would 
occur at different positions from cell cycle to cell cycle 
p^ . Then if one assumes that (i) the ori are identi- 
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FIG. 5: A model for replication in the human genome, 
(a) Theoretical skew profiles obtained when assuming that 
two equally active adjacent ori are located at n/l = and 
1, where I is the ori spacing; the 3 profiles in thin, thick 
and normal lines, correspond to different ter positions, (b) 
Theoretical mean S profile obtained by summing step-like 
profiles as in (a), under the assumption of a uniform random 
positioning of the ter in between the two ori. 

cally active and (ii) any location in between two adja- 
cent ori has an equal probability of being a ter, the con- 
tinuous superposition of step-like profiles like those in 
Fig. ini^a) leads to the anti-symmetric skew pattern shown 
in Fig.injb), i.e. a linear decreasing S profile that crosses 
zero at middle distance from the two ori. This model is 
in good agreement with the overall properties of the skew 
profiles observed in the human genome and sustains the 
hypothesis that each detected upward jump corresponds 
to an ori. 

To summarize, we have proposed a simple model for 
replication in the human genome whose key features are 
(i) well positioned ori and (ii) a stochastic positioning of 
the ter. This model predicts jagged skew profiles as ob- 
served around most of the experimentally identified ori 
as well as along the 22 human autosomal chromosomes. 



Using this model as a guide, we have selected 287 do- 
mains delimited by pairs of successive upward jumps in 
the S profile and covering 24% of the genome. The 486 
corresponding jumps are likely to mark 486 ori active in 
the germ line cells. As regards to the rather large size of 
the selected sequences 2 Mbp on the native sequence), 
these putative ori are likely to correspond to the large 
rcplicons that require most of the S-phase to be repli- 
cated [l^. Another possibility is that these ori might 
correspond to the so-called replication foci observed in 
interphase nuclei These stable structures persist 

throughout the cell cycle and subsequent cell generations, 
and likely represent a fundamental unit of chromatin or- 
ganization. Although the prediction of 486 ori seems a 
significant achievement as regards to the very small num- 
ber of experimentally identified ori, one can reasonably 
hope to do much better relatively to the large number 
(probably several tens of thousands) of ori. Actually 
what makes the analysis quite difficult is the extreme 
variability of the ori spacing from 100 kbp to several Mbp, 
together with the necessity of disentangling the part of 
the strand asymmetry coming from replication from that 
induced by transcription, a task which is rather delicate 
in regions with high gene density. To overcome these 
difhculties, we plan to use the WT with the theoretical 
skew profile in Fig.[5l^b) as an adapted analyzing wavelet. 
The identification of a few thousand putative ori in the 
human genome would be a very promising methodologi- 
cal step towards the study of replication in mammalian 
genomes. 
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